1 INTRODUCTION

Real estate is an asset that people would like to invest in; as it not only brings about financial benefits but is also an asset for long-term sustainable goals. Understanding the market trend and the value of properties will suggest optimal choices for house sellers and buyers to optimise their purchase while having a comfortable experience that is suitable with their needs.

However, the dynamics of real estate and houses are influenced by multiple factors. This study would test four important aspects: the property’s age, size of the house, price trends of the market, and a unique, premium feature - waterfront. Specifically, we would like to perform tests on 04 main issues: (1) The market preference for the house’s size through observing the median; (2) The market preference for the house’s premium feature - waterfront - through observing the proportion in transactions; (3) The impact of housing ages on the conditions through observing the association; (4) the price trends between innovated houses and newly constructed houses. By addressing these topics, we would like to form a source for related stakeholders (sellers, buyers, investors, brokers) to have a comprehensive picture of the USA property market trends at a specific time point to refer to their decisions when it comes to housing.

2 GENERAL DATA PACKAGE AND DATASET

2.1 Dataset

We will be using the dataset “US House Price”. This dataset gives information about each house transaction in the Seattle Metropolitan Area in the USA from May to July 2014. The necessary variables we will be testing:

  1. Price: The property’s sale price in USD serves as our target variable. This is the primary outcome we aim to predict and analyze in relation to other features.
  2. Sqft_Living: The size of the living area in square feet usually reflects property value. Larger living spaces are generally associated with higher prices, allowing us to assess the relationship between size and value.
  3. Waterfront: A binary indicator showing whether the property has a waterfront view (1 for yes, 0 for no). Waterfront properties often enjoy higher valuations due to their desirability.
  4. Condition: Rated from 1 to 5, this index reflects the property’s overall condition. Well-maintained properties typically command higher prices.
  5. Yr_Built: The year the property was constructed can influence market value, as older properties may have historical significance while newer ones offer modern amenities.
  6. Yr_Renovated: The year of the last renovation provides insight into upkeep and appeal. Recent renovations can significantly enhance value. City: This feature indicates the city where the property is located. Different cities within the Seattle Metro area exhibit distinct market dynamics.

2.2 Data packages

  1. tidyverse: A collection of packages for data manipulation, visualization, and analysis.
  2. readxl: Used for reading Excel files.
  3. infer: A package for statistical inference.
  4. DT: A package for creating interactive data tables.
  5. dplyr: A core package in the tidyverse for data manipulation.
  6. ggplot2 and plotly: Two packages to visualize data and we use plotly to make the data more interactive.
  7. janitor: A package used to clean the data
library(readxl)
library(dplyr)
library(tidyverse)
library(infer)
library(DT)
library(ggplot2)
library(plotly)
library(janitor)

2.3 Import file and process the general data

We import our file in use:

master_housing <- read_csv("USA Housing Dataset (1).csv")

3 HYPOTHESIS

3.1 Hypothesis 1 (Inference for a numerical variable): The median house size of municipal cities in the Seattle Metropolitan Area is equal to 2,185 square feet.

3.1.1 Motivation

There was a discussion over whether prospective buyers should choose large houses or small home areas. While a spacious house can offer more room for welcoming guests, hosting bonding activities, or meeting additional needs, affordability, disconnectivity and insufficient resources present drawbacks hindering house hunters from making purchase decisions. (Shalom, 2018) As a result, budget, preferences and needs critically determine housing choices. (SuperAdmin, 2022)

These considerations are relevant to the Seattle metropolitan area, where the house size is reported to have been decreasing for years. A significant factor contributing to the phenomenon is the rise of townhouse construction. As of 2000, 40% of homes in the region were townhouses, and ongoing shifts in development patterns have reduced the average lot size by 30% over the past two decades. (Gatea, 2021). This has further increased the price of houses, as it was already expensive due to the strong economy with technological, healthcare, or maritime industries, prestigious education, the supply-demand imbalance, etc. (Fox et al., 2024).

Given the continuous decline in house sizes, it is reasonable to hypothesize that the median house size in the Seattle Metropolitan Area may differ from other areas than its state (Washington)’s median. The median house size of Washington, 2,185 sq. ft in 2022, serves as a valuable point of comparison. By examining whether Seattle’s median house size diverges from this benchmark, we can better understand the impacts of unique market pressures in Seattle’s housing market. The median house size of Washington, 2,185 sq. ft in 2022 (NeoMam Studios, 2022), serving as a valuable point of comparison. Since Seattle is the state’s largest metropolitan hub, its housing patterns could reflect broader state trends. However, given the city’s unique housing pressures — such as higher land prices, increased demand for urban living, and the rise of compact townhouses — Seattle’s house sizes may deviate from the state median. Regarding the area, because of its growing economic and service development, which may lead to higher house price, it can be assumed that the house size can be smaller than the state’s median. By testing whether the median house size in the Seattle Metropolitan Area aligns with the state’s median, we can assess if Seattle’s housing market follows statewide norms or if it exhibits distinct characteristics. This distinction can offer valuable insight into urban development, housing affordability, and the spatial efficiency of homes in metropolitan areas compared to more suburban or rural parts of the state.

3.1.2 Hypothesis statement

  • Null hypothesis (\(H_0\)): The median house size of municipal cities in the Seattle Metropolitan Area was equal to 2,185 square feet
    • In symbol: \(H_0: \mu = 2185\)
  • Alternative hypothesis (\(H_\alpha\)): The median house size of municipal cities in the Seattle Metropolitan Area was fewer than 2,185 sq feet
    • In symbol: \(H_a: \mu < 2185\)
  • Significance level: \(\alpha = 0.05\)
null_hypothesis_1 <- master_housing %>%
    specify(response = sqft_living) %>%
    hypothesize(null = "point", med = 2185)

3.1.3 Data processing

3.1.3.1 Step 1. State the test statistics and testing method

  • Test statistic: The sample median

The sample median was selected over the mean to avoid potential effects of potential outliers, especially with datasets containing extreme values. Moreover, due to its little sensitivity to outliers, it could provide a more reflective representation of the central tendency for such data.

  • Testing method: Bootstrap sampling

To assess the statistical significance of the dataset, bootstrap sampling method was our option. Since the method does not require the data to follow a known, specific distribution, we can generate empirical distribution of the median and assess the statistical significance of the observed sample median.

3.1.3.2 Step 2. Graph the null distribution

Using the bootstrapping method, we resampled the dataset 1,000 times with replacement, generating 1,000 median values. Then, assuming that 2,185 as the sample median, we shifted the null distribution, so 2,185 square feet became the central value. As the standard deviation remains intact during the process, we can proceed to test whether the observed value is likely to occur under this expected scenario.

null_distribution_1 <- null_hypothesis_1 %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "median")
null_distribution_1
## Response: sqft_living (numeric)
## Null Hypothesis: point
## # A tibble: 1,000 × 2
##    replicate  stat
##        <int> <dbl>
##  1         1  2195
##  2         2  2195
##  3         3  2175
##  4         4  2189
##  5         5  2185
##  6         6  2187
##  7         7  2175
##  8         8  2165
##  9         9  2215
## 10        10  2185
## # ℹ 990 more rows
# Create a ggplot histogram
null_graph_1 <- ggplot(null_distribution_1, aes(x = stat)) +
  geom_histogram(binwidth = 10, fill = "#81bfda", color = "black", alpha = 0.7) +
  labs(title = "Bootstrap Distribution of Median Square Footage",
       x = "Median Square Footage (sqft)",
       y = "Frequency") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 18, color = "dark blue", face = "bold", family = "serif"))

# Convert ggplot to plotly for interactivity
ggplotly(null_graph_1)

Figure 3.1.3.2. The bootstrap distribution of median square footage.

3.1.3.3 Step 3. Calculate the observed statistics

We computed the observed median house size in the area from the column sqft_living (the living area in the unit of square feet). The recorded data was 1,980 sq. feet.

observed_hypothesis_1 <- master_housing %>%
  specify(response = sqft_living) %>%
  calculate(stat = "median")
observed_hypothesis_1
## Response: sqft_living (numeric)
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  1980

3.1.3.4 Step 4. Calculate the p-value

To derive the p-value, we compared the observed statistic to the null distribution. The p-value equals to the probability that the value equal or more extreme than the observed statistic appears compared to the null distribution dataset; in this case, we have: \(p_{value} = \frac{N(\mu \leq 2185)}{N}\) (here: \(p = 0\)).

# Calculate p-value
p_value_1 <- null_distribution_1 %>%
  get_p_value(obs_stat = observed_hypothesis_1, direction = "less")

# Show the p value in 4 decimals
p_value_1 <- round(p_value_1$p_value, 4)
p_value_1
## [1] 0

Then, we add up the visualization of the observed statistic to the null distribution:

# Visualize the null distribution with shaded p-value and additional customization
ggplotly(null_distribution_1 %>%
  visualize() +
  shade_p_value(obs_stat = observed_hypothesis_1, direction = "less") +
  geom_vline(aes(xintercept = observed_hypothesis_1$stat), color = "darkred", linetype = "dashed", size = 1) +
  labs(title = "Null Distribution of Median Square Footage with Observed Statistic",
       x = "Median Square Footage (sqft)",
       y = "Frequency") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold", family = "Times", color = "darkred"),
        axis.title = element_text(size = 12),
        axis.text = element_text(size = 10)) +
  scale_fill_manual(values = c("blue")))

Figure 3.1.3.4. Null Distribution of Median Square Footage with Observed Statistic.

3.1.3.5 Step 5. Interpretation - Statistical Conclusion

Having derived that \(p_{value} < 0.05\), we rejected the null hypothesis: The median house size of municipal cities in the Seattle Metropolitan Area is equal to 2,185 square feet. The probability of observing a median equal to or lower than 2,185 square feet is extremely low under the null hypothesis (rare); therefore, it is unlikely that the null hypothesis is true.

# Set the significance level
alpha <- 0.05

# Test conclusion
if (p_value_1 < alpha) {
  conclusion <- "Reject the null hypothesis: The median house size of municipal cities in the Seattle Metropolitan Area is significantly fewer than 2,185 square feet."
} else {
  conclusion <- "Fail to reject the null hypothesis: There is not enough evidence to conclude that the median house size of municipal cities in the Seattle Metropolitan Area is significantly fewer than 2,185 square feet."
}

# Display the conclusion
conclusion
## [1] "Reject the null hypothesis: The median house size of municipal cities in the Seattle Metropolitan Area is significantly fewer than 2,185 square feet."

3.1.3.6 Step 6: Further test with bootstrap confidence interval of values of test statistics

To consolidate the conclusion and figure out the range of the median size’s possible values, we tested the data with the confidence interval of values. Using the similar boostrapping method like Step 3 (without shifting the distribution), we have the actual boostrap distribution. In this distribution, we noted our \(95\%\) confidence interval, counted between the quantile \(2.5\%\) and \(97.5\%\) of the distribution (excluding \(5\%\) of extreme values). The \(95\%\) confidence interval for the median house size ranged from 1950 to 2010 square feet, which is below 2185. Therefore, we have statistical evidence to support the alternative hypothesis that the median house size in the Seattle Metropolitan Area was less than 2185.

# Calculate 95% confidence interval
boot_distn_one_median <- master_housing %>%
  specify(response = sqft_living) %>%
  generate(reps = 10000, type = "bootstrap") %>% 
  calculate(stat = "median")

ci <- boot_distn_one_median %>% 
  get_ci()
ci
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     1950     2010
# Visualize the bootstrap distribution with confidence interval and make it interactive
ggplot(boot_distn_one_median, aes(x = stat)) +
  geom_histogram(binwidth = 10, fill = "#81bfda", color = "black", alpha = 0.7) +
  geom_vline(aes(xintercept = ci$lower_ci), color = "#E38E49", linetype = "solid", size = 2) +
  geom_vline(aes(xintercept = ci$upper_ci), color = "#E38E49", linetype = "solid", size = 2) +
  geom_text(aes(x = ci$lower_ci, y = Inf, label = "Lower CI"), color = "#E38E49", vjust = -0.5, hjust = 1.1) +
  geom_text(aes(x = ci$upper_ci, y = Inf, label = "Upper CI"), color = "#E38E49", vjust = -0.5, hjust = -0.1) +
  scale_fill_brewer(palette = "Set3") +
  labs(title = "Bootstrap Distribution of Median Square Footage\nwith Confidence Interval",
       x = "Median Square Footage",
       y = "Frequency", size = 14) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold", family = "Times", color = "darkblue"),
        axis.title = element_text(size = 14),
        axis.text = element_text(size = 12),
        panel.grid.major = element_line(color = "gray80"),
        panel.grid.minor = element_line(color = "gray90")) +
  annotate("rect", xmin = ci$lower_ci, xmax = ci$upper_ci, ymin = -Inf, ymax = Inf, alpha = 0.3, fill = "#E38E49")
**Figure 3.1.3.6.** Bootstrap Distribution of Median Square Footage with Confidence Interval.

Figure 3.1.3.6. Bootstrap Distribution of Median Square Footage with Confidence Interval.

3.1.4 Result and Discussion

3.1.4.1 Result

Based on our analysis, we have strong statistical evidence that the median house size in cities within the Seattle Metropolitan Area is smaller than 2,185 square feet. The conclusion is supported by two evidence:

  • The p-value is 0, so we Rejected the null hypothesis: The median house size of municipal cities in the Seattle Metropolitan Area was significantly fewer than 2,185 square feet.
  • Confidence interval: \(95\%\) confidence interval for the median house size ranges from 1950 and 2010 square feet, which is below 2,185 square feet. Therefore, we have statistical evidence to support the alternative hypothesis that the median house size was below 2,185 square feet.

3.1.4.2 Discussion

Insights on House Size Trends in the Seattle Metropolitan Area are included but not unlimited above:

1. Deviation from the general state assumption:

The analysis reveals the smaller median house size in the Seattle Metropolitan Area compared to that in the Washington state. The conclusion might pose a challenge that local median sizes align with state-level figures.

2. Possible explanations for smaller house sizes in the area (besides increasing townhouse model):

  • Population growth and workforce demand: The number of workers has been growing because of the presence of attractive high-tech companies. This population increase drives the need for smaller and more affordable options to accommodate the influx of residents.

  • Dynamics in living arrangement: We can see the rise of new house types, such as apartments, multi-family unit, and multi-generational living, reducing the house size of this region.

Having highlighted the customers’ housing behavior and trends, businesses and policymakers should pay more attention for future housing development, payment planning, and market strategies.

3.1.4.3 Limitations

1. Data relevance

The benchmarks for comparison were extracted from data in 2022, different from our tested dataset (Quarter 2, 2014). The time gap can affect our relevance of the comparison due to potential changes in housing trends over time.

2. Data Representation

The sample size we used was small, potentially leading to data bias and limiting the generality of the results.To reflect the housing trends, a larger sample size would be required.

3. The use of Central Median as a measure

The test cannot capture the distributional characteristics of house size (because of discarding outliers). Therefore, we could not approach the dataset comprehensively to analyze its other statistical features.

3.1.4.4 Suggestions

3.1.4.4.1 For Buyers

Buyers should adjust their size expectations in the area because houses are even generally smaller than typical houses in the Washington state. Also, if there are too few options, they should consider diversifying their options in house types and neighborhood selection, or looking for an alternative group. By adopting a more flexible approach, they could make a better decision rather than sticking to unmatched options like houses in the Seattle Metropolitan Area.

3.1.4.4.2 For Salesmen

Salesmen should strategize the pricing and selling direction more carefully. Some suggestions for them include utilizing larger house size for the key target audience (when there are many small houses, the presence of larger houses will be outstanding), or maximizing small houses’ functions (their houses will have the USP to compete other similar houses).

For example, they can enlist large properties as a premium offer that attracts those preferring spacious living for the first case, or highlight key features of the house to foster the market competitiveness.

3.2 Hypothesis II (Inference for a categorical variable): The proportion of transactions of houses that have waterfront to the overall number of transactions is equal to 0.6%

3.2.1 Motivation

Waterfront properties are widely regarded as premium real estate features because of higher market values and unique lifestyle benefits, such as scenic views, privacy, and rental income potential. However, due to its high value, unique location, and limited availability, this kind of house is always in high demand, while the supply is rather limited. (Amres, 2024)

To understand the extent to which the availability of this kind of house, we would like to know what the proportion of houses with waterfront compared to all houses was. Through research, an article stated that “This (waterfront houses) is a rare kind of home. In a given year, about 0.4 percent to 0.6 percent of all property transactions are for houses on the water.” (Forbe, 2018). To validate the precision of the information, we would like to perform a hypothesis testing on the claim that “The proportion of houses with waterfront takes up \(0.6\%\) of all houses transactions” on the dataset for USA’s House Transactions from May to July 2014.

3.2.2 Hypothesis statement

  • Null hypothesis (\(H_0\)): The proportion of transactions of houses that have waterfront to the overall number of transactions is equal to \(0.6\%\).
    • In symbol: \(H_0: p = 0.006\)
  • Alternative hypothesis (\(H_\alpha\)): The proportion of transactions of houses that have waterfront to the overall number of transactions is greater than \(0.6\%\).
    • In symbol: \(H_a: p > 0.006\)
  • Significance level: \(\alpha = 0.05\)
# Convert the waterfront column to a factor with levels "0" and "1"
master_housing$waterfront <- factor(master_housing$waterfront, levels = c("0", "1"))
# Specify the null hypothesis
null_hypothesis_2 <- master_housing %>%
  specify(response = waterfront, success = "1") %>%
  hypothesize(null = "point", p = 0.006)

3.2.3 Data processing

3.2.3.1 Step 1. State the test statistics and testing method

In the dataset “USA House Price”, waterfront is a variable would be used that is described as follows:

“ A binary indicator showing whether the property has a waterfront view (1 for yes, 0 for no). Waterfront properties often enjoy higher valuations due to their desirability.”

  • Test statistic: The sample proportion (the proportion earned from the given dataset (that is the ratio between the number of houses transactions with waterfront and total house transactions)
  • Testing method: One proportion test (since we are comparing the proportion with a dataset)

3.2.3.2 Step 2. Graph the null distribution

  • The null distribution represents the distribution of the sample proportion of waterfront houses under the assumption that the null hypothesis is true. In this case, the null hypothesis states that the true proportion of waterfront houses is \(0.6\%\) (or 0.006).
  • To get this distribution, we created a simulation by generating 1000 samples from the population assuming the null hypothesis is true. And for each sample, we calculate the proportion of waterfront houses. After that, we code into a table and draw a histogram to visualize the distribution.
# Define the null hypothesis proportion
null_hypothesis_2 <- 0.006

# Generate the null distribution
null_distribution_2 <- master_housing %>%
  specify(response = waterfront, success = "1") %>%
  hypothesize(null = "point", p = null_hypothesis_2) %>%
  generate(reps = 1000, type = "draw") %>%
  calculate(stat = "prop")

head(null_distribution_2)
## Response: waterfront (factor)
## Null Hypothesis: point
## # A tibble: 6 × 2
##   replicate    stat
##       <int>   <dbl>
## 1         1 0.00556
## 2         2 0.00362
## 3         3 0.00652
## 4         4 0.00531
## 5         5 0.00797
## 6         6 0.00604
ggplotly(ggplot(null_distribution_2, aes(x = stat)) +
  geom_histogram(binwidth = 0.001, fill = "#81bfda", color = "black", alpha = 0.7) +
  labs(title = "Histogram of the Null Distribution",
       x = "Proportion of Waterfront Houses",
       y = "Frequency") +
  theme_minimal() + theme(plot.title = element_text(hjust = 0.5, size = 18, color = "dark blue", face = "bold", family = "serif")))

Figure 3.2.3.2. Histogram of the Null Distribution (given the null hypothesis of hypothesis 2 is true).

3.2.3.3 Step 3. Calculate the observed data

Observed statistics is the value calculated from the given data sample, that is to calculate the proportion of houses with waterfront over total houses. We got a number of 0.0075, which means that \(0.75\%\) of house transactions in the dataset are houses with waterfront.

observed_statistic_2 <- master_housing %>%
  specify(response = waterfront, success = "1") %>%
  calculate(stat = "prop")

observed_statistic_2
## Response: waterfront (factor)
## # A tibble: 1 × 1
##      stat
##     <dbl>
## 1 0.00749

3.2.3.4 Step 4. Calculate the p-value

  • We are calculating p-value here by determining the proportion of sample proportions in the null distribution that are greater than or equal to the observed sample proportion (that is to calculate the proportion of these simulated sample proportions that are greater than or equal to observed sample proportion - 0.0075).
p_value <- null_distribution_2 %>%
  get_p_value(obs_stat = observed_statistic_2$stat, direction = "greater")

p_value
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1   0.138

Then, we add up the visualization of the observed statistic to the null distribution:

null_distribution_2 %>%
  visualize() +
  shade_p_value(obs_stat = observed_statistic_2$stat, color = "darkred", direction = "greater") + theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold", family = "Times", color = "darkred"),
        axis.title = element_text(size = 12),
        axis.text = element_text(size = 10)) +
  scale_fill_manual(values = c("blue"))
**Figure 3.2.3.4.** Null Distribution of Proportion of Houses having Waterfront view with Observed Statistic.

Figure 3.2.3.4. Null Distribution of Proportion of Houses having Waterfront view with Observed Statistic.

3.2.3.5 Step 5. Interpretation - Statistical Conclusion

  • ‘Fail to reject the null hypothesis: There is not enough evidence to conclude that the proportion of waterfront homes in the sample is significantly different from the hypothesized 0.6%.’
  • From the test, a p-value of 0.131 means there is a 13.1% chance of observing a sample proportion of waterfront houses as extreme or more extreme than the observed statistics (0.075), assuming the null hypothesis is true.
  • As we chose the significance level (alpha) as 0.05, the p-value of 0.122 is significantly higher than that of alpha. Therefore, we do not have enough evidence to reject the null hypothesis. Also, the observed proportion of waterfront houses is not significantly higher than the null hypothesis proportion of 0.006.
# Set the significance level
alpha <- 0.05

# Test conclusion
if (p_value$p_value < alpha) {
  conclusion <- "Reject the null hypothesis: The proportion of waterfront homes in the sample is significantly higher than the hypothesized 0.6%"
} else {
  conclusion <- "Fail to reject the null hypothesis: There is not enough evidence to conclude that the proportion of waterfront homes in the sample is significantly different from the hypothesized 0.6%."
}

# Display the conclusion
conclusion
## [1] "Fail to reject the null hypothesis: There is not enough evidence to conclude that the proportion of waterfront homes in the sample is significantly different from the hypothesized 0.6%."

3.2.3.6 Step 6: Further test with bootstrap confidence interval of values of test statistics

  • For this step, we generated a bootstrap distribution for the sample proportion of waterfront houses by resampling the data 10,000 times.

  • The result for confidence interval: in the range \([0.0051, 0.0101]\); with the lower bound of 0.0051 and upper bound of 0.0101. This means that the true proportion of waterfront houses is likely to fall within this range \([0.0051, 0.0101]\).

  • And \(p = 0.006\) does fall in the range, so \(p =0.006\) can be one plausible value for the true proportion. Also, we still do not have enough evidence to reject the null hypothesis as the observed data does not show a significant difference from the null hypothesis value.

# Generate bootstrap distribution for one proportion
boot_distn_one_prop <- master_housing %>%
  specify(response = waterfront, success = "1") %>%
  generate(reps = 10000, type = "bootstrap") %>%
  calculate(stat = "prop")

# Calculate the confidence interval
ci_2 <- boot_distn_one_prop %>%
  get_ci()

# Display the confidence interval
ci_2
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1  0.00507   0.0101
ggplot(boot_distn_one_prop, aes(x = stat))  +
  geom_histogram(binwidth = 0.001, fill = "#81bfda", color = "black", alpha = 0.7) +
  geom_vline(aes(xintercept = ci_2$lower_ci), color = "#E38E49", linetype = "solid", size = 2) +
  geom_vline(aes(xintercept = ci_2$upper_ci), color = "#E38E49", linetype = "solid", size = 2) +
  geom_text(aes(x = ci_2$lower_ci, y = Inf, label = "Lower CI"), color = "#E38E49", vjust = -0.5, hjust = 1.1) +
  geom_text(aes(x = ci_2$upper_ci, y = Inf, label = "Upper CI"), color = "#E38E49", vjust = -0.5, hjust = -0.1) +
  scale_fill_brewer(palette = "Set3") +
  labs(title = "Simulation-based Bootstrap Distribution",
       x = "proportion",
       y = "count", size = 14) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold", family = "Times", color = "darkblue"),
        axis.title = element_text(size = 14),
        axis.text = element_text(size = 12),
        panel.grid.major = element_line(color = "gray80"),
        panel.grid.minor = element_line(color = "gray90")) +
  annotate("rect", xmin = ci_2$lower_ci, xmax = ci_2$upper_ci, ymin = -Inf, ymax = Inf, alpha = 0.3, fill = "#E38E49")
**Figure 3.2.3.6. Bootstrap Distribution of Proportion of Houses having Waterfront view with Confidence Interval.

**Figure 3.2.3.6. Bootstrap Distribution of Proportion of Houses having Waterfront view with Confidence Interval.

3.2.4 Result and Discussion

3.2.4.1 Summarise the results

  • To summarize, we conclude that the p-value of our test is higher than the significance level.
  • As for the bootstrapping method for confidence intervals, \(p = 0.06\) still lies in the \(95\%\) confidence interval \([0.0051, 0.0101]\).
  • This means that the observed data is consistent with the null hypothesis, and there is no statistically significant difference between the observed proportion of waterfront homes and the hypothesized proportion of \(0.6\%\).

3.2.4.2 Explanation - Potentia reasons

  • The failure to reject the hypothesis, meaning that the proportion of houses with waterfront over total houses is not significantly higher than 0.6% implied that the market’s trend of buying waterfront houses might not be as high as anticipated in the USA, even in the summer. This result can be attributed to:
    • Socioeconomic factors: the high cost of waterfront houses might hinder the purchase of them; even if they bring benefits and have high preference. Those waterfront properties are often out of reach for many people due to their high cost. The median house price in the USA in Washington DC in 2014 was \(\$405,750\). (Kathy Orton, 2014); whereas the median for waterfront house prices from the dataset is double that. \(\Rightarrow\) Low actual demand - people want to buy it but the price is too high.
    • Complex environmental regulation: “In addition to environmental dangers, there are legal complexities in owning waterfront properties that must be carefully considered before buying.” (Soden, 2024)
  • Strict regulations aimed at protecting coastal ecosystems and laws protecting wetlands can limit the availability of buildable land near water bodies. Therefore, the development in those areas might be restricted, which leads to a low supply of waterfront houses due to scarcity.

3.2.4.3 Limitations

  • Lack of sources for references: The null hypothesis of 0.6% might not be the most appropriate for this context as the dataset is specified in Washington DC.; but 0.6% is from the USA as a whole. A different null hypothesis could be more suitable, depending on available information and expert knowledge.

  • Lack of data representation: the data collected is only from states within the Seattle metropolises, which might not be representative of the entire population of transactions, in terms of geographic and socioeconomic factors.

3.2.4.4 Suggestions

3.2.4.4.1 For buyers:

Houses with waterfront could be extremely expensive, so buyers should be realistic about the budget and consider the high costs associated with waterfront properties. Buying a waterfront could bring unique experiences, but limited availability and low supply with other environmental issues can be big shortcomings of houses with waterfront.

3.2.4.4.2 For sellers:
  • Sellers who have waterfront houses might need to highlight the premium and the scarcity of houses as unique features of houses with waterfront to attract customers.
  • Sellers should also adjust the pricing strategies according to the market trend to gain profit.
  • Sellers should also ensure compliance with law and environmental regulations to avoid legal issues.
3.2.4.4.3 For real estate brokers:
  • Brokers should educate the client about the limited availability of waterfront houses, and how it could lead to higher prices; and also the associated risks relating to law and environment.
  • Also, they can benefit from the scarcity of waterfront houses as a premium feature to attract wealthy customers for investment.
  • Most importantly, brokers should diversify their product portfolios and preferably put a focus on houses without waterfront as more than 99% of transactions are houses without waterfront.
3.2.4.4.4 For future research:
  • Refining the hypothesis: more effort in tracing the appropriate proportion for comparison should be used to make the hypothesis testing more meaningful. This is because the proportion collected to examine is from 2018, which is 04 years different from that of the data; so there might be some confounding variables affecting the result.
  • Deeper analysis on some characteristics of houses with waterfront: explore geographic variations in the distribution and pricing of waterfront properties - which states might have a higher concentration of water bodies so that it had more houses with waterfronts?

3.3 Hypothesis III (Hypothesis test for dependence between two categorical variables): The condition of a house (ratings from 1 to 5) is independent of the house’s age group.

3.3.1 Motivation

Housing age often influences maintenance requirements, safety, and market value, making it a crucial factor for investors to consider when allocating funds to real estate (FasterCapital, n.d.). Some people think that older houses have worse conditions than younger ones. We use the given data to test whether there is a dependence between a house’s condition and its age group. By testing the relationship between houses’ age groups (three age groups) and their conditions, we can determine whether a house’s age is a significant factor in predicting the condition of a property or if other variables, such as location or socioeconomic factors, play a more prominent role. The findings from our analysis can provide valuable insights to guide investment strategies for both local and international investors, including those from Vietnam, looking to navigate the US housing market effectively.

3.3.2 Hypothesis statement

3.3.2.1 Hypothesis

  • Null hypothesis (\(H_0\)): The condition of a house (ratings from 1 to 5) is independent of the house’s age group.
  • Alternative hypothesis (\(H_\alpha\)): The condition of a house depends on its age group (older houses tend to have worse conditions).
  • Significance level: \(\alpha = 0.05\)

3.3.2.2 Data source and structure

  • The test will use two columns condition and age_group
  • The age_group column was derived by extracting the year from the date column and calculating the difference between this year and the yr_built column.
  • After having the age_group column, we divide the houses into 3 groups: 0-33, 34-68, 69-114
  • Age group grouping method: We categorized housing ages into groups based on architectural styles and shared characteristics after identifying similarities in their design and construction trends (Forbes Home, n.d.).
    • 0-33 group (yr. built: 1981-2014 - Modern Construction): These homes are often considered “historic,” featuring architectural styles like Queen Anne, Prairie, Colonial Revival, and Craftsman Bungalow. They are valued for character and historical significance but may require significant updates to meet modern standards.
    • 34-68 group (yr. built: 1946-1980 - Post-War Boom Era): Housing during this period expanded dramatically due to economic growth after World War II, driven by suburbanization and the baby boom. Styles include Minimal Traditional, Mid-Century Modern, Split-Level, and Ranch homes, reflecting functional designs optimized for growing families.
    • 69-114 group (yr. built: 1900-1945 - Historic Era): This group includes homes that have benefited from advancements in building codes, modern materials, and enhanced energy efficiency standards. Suburban and urban developments during this period introduced a variety of contemporary architectural styles, prioritizing open floor plans and sustainable design elements. Common styles from this era include Contemporary, Neo-Colonial, Millennium Mansions, and Modern Farmhouses, reflecting evolving preferences for functionality and aesthetic appeal.
# Add age_group column to master_housing based on the age of houses
master_housing <- master_housing %>%
  mutate(age = 2014 - yr_built,
         age_group = case_when(
           age <= 33 ~ "0-33",
           age <= 68 ~ "34-68",
           TRUE ~ "69-114"
         ))

# Display the updated master_housing dataset
master_housing
## # A tibble: 4,140 × 20
##    date                  price bedrooms bathrooms sqft_living sqft_lot floors
##    <dttm>                <dbl>    <dbl>     <dbl>       <dbl>    <dbl>  <dbl>
##  1 2014-05-09 00:00:00  376000        3      2           1340     1384    3  
##  2 2014-05-09 00:00:00  800000        4      3.25        3540   159430    2  
##  3 2014-05-09 00:00:00 2238888        5      6.5         7270   130017    2  
##  4 2014-05-09 00:00:00  324000        3      2.25         998      904    2  
##  5 2014-05-10 00:00:00  549900        5      2.75        3060     7015    1  
##  6 2014-05-10 00:00:00  320000        3      2.5         2130     6969    2  
##  7 2014-05-10 00:00:00  875000        4      2           2520     6000    1  
##  8 2014-05-10 00:00:00  265000        4      1           1940     9533    1  
##  9 2014-05-10 00:00:00  394950        3      2.5         1350     1250    3  
## 10 2014-05-11 00:00:00  842500        4      2.5         2160     5298    2.5
## # ℹ 4,130 more rows
## # ℹ 13 more variables: waterfront <fct>, view <dbl>, condition <dbl>,
## #   sqft_above <dbl>, sqft_basement <dbl>, yr_built <dbl>, yr_renovated <dbl>,
## #   street <chr>, city <chr>, statezip <chr>, country <chr>, age <dbl>,
## #   age_group <chr>

3.3.3 Data processing

3.3.3.1 Step 1. State the test statistics and testing method

  • Test statistic: Chi-square Test Statistic \(X^2 (T)\)

The Chi-square test statistic was chosen because it can measure how much the observed frequencies for condition and house age group differ from the expected frequencies under the null hypothesis.

  • Testing method: Chi-square Test of Independence

This statistical test was used to evaluate the association between condition and age_group in the actual dataset. It serves as a reference point for comparison with the null distribution in a hypothesis test. Using the Chi-square test of independence, we aim to test the hypothesis that housing age influences the condition of homes in the USA. For instance, older homes might more frequently fall into the “level 1” condition category due to outdated materials and aging infrastructure, while newer homes might dominate the “level 5” category because of modern construction standards and materials.The formula follows the equation below:

\[T = \sum_{i,j} \frac{(E_{ij} - O_{ij})^2}{E_{ij}}\].

  • \(E_{ij}\) is the expected counts in the table cell at row \(i\), column \(j\) if there is no dependence between house condition and house age groups.
  • \(O_{ij}\) is the observed count in the table cell at row \(i\), column \(j\).
# Load necessary library
# Create a table of house conditions and age groups
condition_age_table <- master_housing %>%
  tabyl(condition, age_group) %>%
  adorn_totals(where = c("row", "col"))

# Display the table
condition_age_table
##  condition 0-33 34-68 69-114 Total
##          1    0     1      4     5
##          2    1    15     11    27
##          3 1587   665    344  2596
##          4  184   652    278  1114
##          5   21   187    190   398
##      Total 1793  1520    827  4140
# Specify the null hypothesis
null_hypothesis3 <-  master_housing %>%
  specify(age_group ~ condition) %>%
  hypothesize(null = "independence")

3.3.3.2 Step 2. Graph the null distribution

  • Null distribution is the distribution of test statistics when the null hypothesis is true.
  • The null distribution followed a chi-square distribution. We constructed a null distribution for the chi-square statistic under the null hypothesis of independence between condition and age_group. We generated 10,000 permuted datasets by shuffling the age_group variable while keeping the structure of the condition variable the same. This simulated the distribution of the test statistic under the null hypothesis.
    • X-axis: the range of values for the test statistic under the null hypothesis.
    • Y-axis: the frequency of observations.

The distribution is right-skewed. The skewness suggests that most values of test statistics are clustered near the lower end, with fewer observations as the values increase.

one_null_sample3 <- null_hypothesis3 %>%
  generate(reps = 1, type = "permute") 

(O_t1=table(one_null_sample3$age_group, one_null_sample3$condition))
##         
##             1    2    3    4    5
##   0-33      2   12 1087  517  175
##   34-68     2    9  970  393  146
##   69-114    1    6  539  204   77
sum((O_t1-E_t)^2/E_t)
## [1] 7.289755
null_hypothesis3$condition <- as.factor(null_hypothesis3$condition)
null_hypothesis3$age_group <- as.factor(null_hypothesis3$age_group)

null_distribution3 = null_hypothesis3 %>%
  specify(response = age_group, explanatory = condition) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 10000, type = "permute") %>%
  calculate(stat = "Chisq")
ggplotly(ggplot(null_distribution3, aes(x = stat)) +
  geom_histogram(binwidth = 2, fill = "#81bfda", color = "black", alpha = 0.7) +
  labs(title = "Histogram of the Null Distribution",
       x = "Chi-square test statistic",
       y = "Frequency") +
  theme_minimal() + theme(plot.title = element_text(hjust = 0.5, size = 18, color = "dark blue", face = "bold", family = "serif")))

Figure 3.3.3.2. Histogram of the Null Distribution of Chi Square Test Statistic

3.3.3.3 Step 3. Calculate the observed statistic

  • Plot the observed count table (\(O_t\)): generating a table that cross-tabulates the counts of observations for each combination of condition (rows) and age_group (columns). The resulting table is included with the total counts for both rows and columns, offering a clear view of marginal totals. Finally, we display the table and provide a comprehensive summary of the distribution of house conditions across different age groups.
(O_t=table(master_housing$age_group, master_housing$condition))
##         
##             1    2    3    4    5
##   0-33      0    1 1587  184   21
##   34-68     1   15  665  652  187
##   69-114    4   11  344  278  190
  • Plot the expected count table (\(E_t\)): generating a table of expected frequencies under the assumption of independence between age_group and condition. These values are essential for calculating the chi-square statistic, as they are compared with the observed frequencies to determine whether an association exists between the variables.
expectedIndependent = function(X) {
     n = sum(X)
     p = rowSums(X)/sum(X)
     q = colSums(X)/sum(X)
     return(p %o% q * n) # outer product creates table
}
(E_t=expectedIndependent(table(master_housing$age_group,master_housing$condition)))
##                1         2         3        4         5
## 0-33   2.1654589 11.693478 1124.3063 482.4643 172.37053
## 34-68  1.8357488  9.913043  953.1208 409.0048 146.12560
## 69-114 0.9987923  5.393478  518.5729 222.5309  79.50386
  • The Chi-square test statistic (formula mentioned above) can be calculated using the code. In the given case, the observed chi-square statistic is 1006.825.
observed_stat3 <- sum((O_t-E_t)^2/E_t)

3.3.3.4 Step 4. Calculate the p-value

  • P-value is the probability of seeing a test statistic as extreme or more extreme than the observed statistic. In this test, the p-value is likely to be close to 0.
(p_value3 <- null_distribution3 %>%
  get_p_value(obs_stat = observed_stat3, direction =  "greater"))
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0
  • The graph of the null distribution represents the expected distribution of test statistics under the assumption that the null hypothesis is true and includes the observed test statistic (red line). In this case, the observed test statistic is far to the right of the null distribution, indicating it is much larger than what we would expect under the null hypothesis.
# Calculate the observed chi-squared statistic
observed_stat3 <- data.frame(stat = sum((O_t - E_t)^2 / E_t))

graph3 <- ggplot(null_distribution3, aes(x = stat)) +
  geom_histogram(binwidth = 5, col="blue", fill = "blue", alpha = 0.7, boundary = 0) +
  geom_vline(aes(xintercept = observed_stat3$stat), color = "darkred", linetype = "dashed", size = 1) +
  labs(title = "Histogram of Null Distribution with Observed Statistic",
       x = "Test statistics",  
       y = "Frequency") + 
  theme(plot.title = element_text(hjust = 0.5, size = 18, color = "dark blue", face = "bold", family = "serif")) +
  theme_minimal()

ggplotly(graph3)

Figure 3.3.3.4. Histogram of Null Distribution with Observed Statistic.

3.3.3.5 Step 5. Interpretation - Statistical Conclusion

  • Since the \(p_{value} = 0 < \alpha = 0.05\), the null hypothesis is rejected.

  • The rejection of the null hypothesis implies that there is statistical evidence to support the alternative hypothesis that the condition of a house is dependent on its age group.

  • In other words, the condition ratings (1 to 5) vary across different house age groups, suggesting that older or newer houses might have distinct condition profiles. This dependency might reflect factors such as maintenance trends, construction standards, or wear over time.

# Set the significance level
alpha <- 0.05

# Test conclusion
if (p_value3$p_value < alpha) {
  conclusion3 <- "Reject the null hypothesis:  The condition of a house (ratings from 1 to 5) is independent of the house's age group."
} else {
  conclusion <- "Fail to reject the null hypothesis: The condition of a house depends on its age group (older houses tend to have worse conditions."
}

# Display the conclusion
conclusion3
## [1] "Reject the null hypothesis:  The condition of a house (ratings from 1 to 5) is independent of the house's age group."

3.3.3.6 Step 6. Further test with bootstrap confidence interval of values of test statistics

  • The bootstrap method is a statistical tool for estimating confidence intervals and assessing the significance of observed results. By resampling the original dataset 1000 times, we generate a distribution that approximates the variability of the test statistic under the null hypothesis.

  • From the bootstrap distribution, we calculated a 95% confidence interval, including the lower and upper bounds for the test statistic. The comparison between the observed test statistic and the null distribution provides a basis for evaluating whether the observed data aligns with the null hypothesis.

  • When the \(95\%\) confidence interval for the test statistic lies entirely above the upper bound, it strongly suggests that the observed effect is unlikely to have occurred by random chance under the null hypothesis. This provides strong evidence supporting the alternative hypothesis. The results, which vary with each bootstrap iteration, always provided statistically significant evidence to reject the null hypothesis. The results suggest the relationship between the condition of a house and its age group, highlighting the importance of considering bootstrap-based inference for reliable conclusions in data analysis.

master_housing$age_group <- as.factor(master_housing$age_group)
master_housing$condition <- as.factor(master_housing$condition)

boot_distn3 <- master_housing %>% 
  specify(age_group ~ condition) %>% 
  generate(reps = 1000, type = "bootstrap") %>% 
  calculate(stat = "Chisq")
ci_boot3 <-boot_distn3  %>% 
  get_ci()
ci_boot3
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     912.    1121.
ggplot(boot_distn3, aes(x = stat))  +
  geom_histogram(binwidth = 20, fill = "#81bfda", color = "black", alpha = 0.7) +
  geom_vline(aes(xintercept = ci_boot3$lower_ci), color = "#E38E49", linetype = "solid", size = 2) +
  geom_vline(aes(xintercept = ci_boot3$upper_ci), color = "#E38E49", linetype = "solid", size = 2) +
  geom_text(aes(x = ci_boot3$lower_ci, y = Inf, label = "Lower CI"), color = "#E38E49", vjust = -0.5, hjust = 1.1) +
  geom_text(aes(x = ci_boot3$upper_ci, y = Inf, label = "Upper CI"), color = "#E38E49", vjust = -0.5, hjust = -0.1) +
  scale_fill_brewer(palette = "Set3") +
  labs(title = "Simulation-based Bootstrap Distribution",
       x = "Test Statistics",
       y = "count", size = 14) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold", family = "Times", color = "darkblue"),
        axis.title = element_text(size = 14),
        axis.text = element_text(size = 12),
        panel.grid.major = element_line(color = "gray80"),
        panel.grid.minor = element_line(color = "gray90")) +
  annotate("rect", xmin = ci_boot3$lower_ci, xmax = ci_boot3$upper_ci, ymin = -Inf, ymax = Inf, alpha = 0.3, fill = "#E38E49")
**Figure 3.3.3.5.**: Simulation-based Bootstrap Distribution (Hypothesis 3).

Figure 3.3.3.5.: Simulation-based Bootstrap Distribution (Hypothesis 3).

3.3.4 Result and Discussion

3.3.4.1 Summarize the result

  • The Chi-square test of independence was conducted to determine whether there is a relationship between house age and house condition. The observed Chi-square statistic was 1006.8248. The p-value was calculated to be extremely small (essentially 0), which is much lower than the typical significance level of 0.05.
  • As \(p_{value} < 0.05\), we reject the null hypothesis and conclude that the condition of a house (ratings from 1 to 5) depends on the house’s age group.

3.3.4.2 Explanation - Potential reasons

  • The result rejects the null hypothesis that the condition of a house (ratings from 1 to 5) is independent of the house’s age group, meaning that the age group of a house has an impact on its condition rating. This result might indicate some patterns such as:
    • Aging and deterioration of the house: When the house ages, the probability of deterioration may increase. Older houses may require more maintenance relating to key components including settling foundations, outdated wiring, aging plumbing systems, faulty carbon monoxide detectors, or worn-out roofing (Today’s Homeowner, n.d.). These requirement may increase old houses long-term costs compared to newer properties.
  • Renovations in older neighborhoods: In neighborhoods with renovation, older homes are often extensively renovated to cater to the preferences of higher-income buyers. These renovations improve both the condition and market value of the properties, which can lead to higher condition ratings despite the homes’ original age. The arrival of wealthier residents typically drives more investment in maintaining and upgrading older houses (Fingers, 2024).

3.3.4.3 Limitations

Result association, not causation: Although the test result shows a relationship between the age group of houses and their condition, this does not imply a causal connection. Our test considers only the chronological age of the house, not the effective age, which reflects the actual condition of the property after renovations and maintenance. With nearly 41% of houses in the dataset having been renovated, the chronological age alone may not accurately represent the true state of these properties.

In addition, other factors, such as renovation quality or maintenance frequency, may also influence the results, and these are not accounted for in the test.

3.3.4.4 Suggestions

For buyers and investors: When buyers and investors should consider both the chronological age and the effective age. While chronological age indicates how long the house has existed, effective age reflects the true remaining life of a property, accounting for the typical life expectancy of such a building as well as its use (CoreLogic, n.d.). A well-renovated older house may have better condition than a newer one that hasn’t been maintained properly. It is important to always inspect the quality of renovations, assess maintenance records, and evaluate how these factors affect the house’s long-term value and potential costs.

3.4 Hypothesis 4 (Hypothesis test for dependence between a numerical and a categorical variable): There is no significant difference in the mean price between renovated houses and non-renovated ones

3.4.1 Motivation

Renovations are often believed to significantly boost a home’s market value, with improvements like modernized kitchens, upgraded bathrooms, and added square footage linked to price premiums. Supporting this, a Ph.D. thesis from the University of Padova found that structural and energy-efficient renovations enhance property value by improving physical attributes and urban appeal. (Xu, L. 2022) Similarly, Cambridge University reported that homes with high Energy Performance Certificate (EPC) ratings sold for up to 14% more, highlighting the market value of energy-efficient upgrades.(Fuerst et al., 2013)

One frequent inquiry is whether renovated homes really sell for more than non-renovated ones. This study aims to investigate this question for the Seattle Metropolitan Area by testing the null hypothesis that there is no significant difference in the average prices of renovated and non-renovated homes.

The motivation for this analysis is to provide house sellers and buyers with evidence-based insights on the financial impact of renovations, helping them make informed decisions by analyzing market data from 1900 to 2014 in the Seattle Metropolitan Area.

3.4.2 Hypothesis statement

\(\mu_R\): Mean price of renovated houses.

\(\mu_N\): Mean price of non-renovated houses.

  • Null hypothesis (\(H_0\)): There is no significant difference in the mean price between renovated houses and non-renovated ones
    • In symbol: \(H_0: \mu_R = \mu_N\)
  • Alternative hypothesis (\(H_\alpha\)): There is a significant difference in the mean price between renovated houses and non-renovated ones
    • In symbol: \(H_a: \mu_R \neq \mu_N\)
  • Significance level: \(\alpha = 0.05\)

3.4.3 Data processing

3.4.3.1 Data extraction and Grouping before hypothesis tesing

This step extracts key columns (price (house price), yr_renovated (the year when the house was renovated), yr_built (the year when the house was originally built), and sqft_living (area of living in the square feet unitmom)) to create a new dataset. The data is then grouped by home size using intervals of 300 square feet. The most popular size group is selected for hypothesis testing. This ensures that comparisons are made between houses of similar size, minimizing confounding effects and enhancing the validity of the results.

# Load the necessary library
library(dplyr)

# Divide the price by
master_housing <- master_housing %>%
  mutate(price = price / 1000)
# Round the value in price to 1 decimal place
master_housing <- master_housing %>%
  mutate(price = round(price, 1))
# Display the updated data frame
master_housing

# Extract the specified columns from the master_housing_city dataframe
data4 <- master_housing %>%
  select(price, yr_renovated, yr_built, sqft_living)
data4

# Create a new column 'Sqft_group' with integer labels for the intervals of 300 feet
data4 <- data4 %>%
  mutate(Sqft_group = cut(sqft_living, 
                          breaks = seq(0, max(sqft_living, na.rm = TRUE) + 300, by = 300), 
                          include.lowest = TRUE, 
                          labels = seq(300, max(sqft_living, na.rm = TRUE) + 300, by = 300)))

head(data4,5)

# Count the rows of each value in column Sqft_group
sqft_group_counts <- data4 %>%
  group_by(Sqft_group) %>%
  summarise(count = n())

# Find the largest count
max_count <- max(sqft_group_counts$count)

# Display the results, just top rows
head(sqft_group_counts, 10)
max_count

# Create the dataset named "data5" from data4 with the rows having Sqft_living = (1800:2100ư only
data5 <- data4 %>%
  filter(Sqft_group ==2100)
# Display the first 5 rows of data5
data5

# Create a new table with three columns: yr_built, price, and status (renovated or newly built)
price_table <- data5 %>%
  mutate(status = ifelse(yr_renovated > 0, "renovated", "non-renovated")) %>%
  select(yr_built, price, status) %>%
  filter(!is.na(price) & price > 0)

head(price_table)
# Load the necessary library
library(infer)

# Specify the null hypothesis that the price of non-renovated houses and renovated houses is not different
null_hypothesis_4 <- price_table %>%
  specify(price ~ status) %>%
  hypothesize(null = "independence")

3.4.3.2 Step 1. State the test statistic

  • The test statistic is the difference between the two sample means: \(μ_R − μ_N\).
  • The null distribution is the distribution of the test statistic when the null hypothesis is true.
# Generate the null distribution
one_null_sample <- null_hypothesis_4 %>%
  generate(reps = 1, type = "permute") 

one_null_sample
## Response: price (numeric)
## Explanatory: status (factor)
## Null Hypothesis: independence
## # A tibble: 591 × 3
## # Groups:   replicate [1]
##    price status        replicate
##    <dbl> <fct>             <int>
##  1  325. renovated             1
##  2  266. non-renovated         1
##  3  290  renovated             1
##  4  392  renovated             1
##  5  249  non-renovated         1
##  6  505  renovated             1
##  7  400  renovated             1
##  8  446. non-renovated         1
##  9  520. non-renovated         1
## 10  288. renovated             1
## # ℹ 581 more rows
one_null_sample %>%
  group_by(status) %>%
  summarize(mean_price = mean(price))
## # A tibble: 2 × 2
##   status        mean_price
##   <fct>              <dbl>
## 1 non-renovated       449.
## 2 renovated           476.
# Corrected R code
one_null_sample %>%
  calculate(stat = "diff in means", order = c("renovated", "non-renovated"))
## Response: price (numeric)
## Explanatory: status (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  26.1

3.4.3.3 Step 2. Graph the null distribution

With 10,000 repetitions, the “generate” function creates a distribution of differences in means assuming the null hypothesis is true (i.e., there is no difference between the two groups).

The ggplot function plots the null distribution as a histogram. Each bar represents the frequency of a specific difference in means that could be observed under the null hypothesis.

# Generate the null distribution
null_distribution_4 <- null_hypothesis_4 %>%
   generate(reps = 10000, type = "permute") %>%
calculate(stat = "diff in means", order = c("renovated", "non-renovated"))
null_distribution_4
## Response: price (numeric)
## Explanatory: status (factor)
## Null Hypothesis: independence
## # A tibble: 10,000 × 2
##    replicate   stat
##        <int>  <dbl>
##  1         1 -20.9 
##  2         2   5.54
##  3         3 -43.7 
##  4         4  15.2 
##  5         5   7.06
##  6         6   3.24
##  7         7 -13.5 
##  8         8 -16.5 
##  9         9  -7.41
## 10        10 -19.6 
## # ℹ 9,990 more rows
library(plotly)
ggplotly(ggplot(null_distribution_4, aes(x = stat)) +
  geom_histogram(binwidth = 10, fill = "#81bfda", color = "black", alpha = 0.7) +
  labs(title = "Histogram of the Null Distribution",
       x = "Difference in means (Renovated - Non-renovated House Price (KUSD)",
       y = "Frequency") +
  theme_minimal() + theme(plot.title = element_text(hjust = 0.5, size = 18, color = "dark blue", face = "bold", family = "serif")))

Figure 3.4.3.2. Null distribution of Difference in means (Renovated house price - Non-renovated house price in KUSD)

#hist(null_distribution_4$stat, main = "Null Distribution", xlab = "Difference in means (Renovated - Non-renovated house price)")

3.4.3.4 Step 3. Calculate the observed data

This step is to calculate the observed difference in means between renovated and non-renovated homes based on the actual data. This observed statistic is then used to compare against the null distribution to assess if the difference is statistically significant.

# Calculate the observed statistic
observed_stat_4 <- price_table %>%
  specify(price ~ status) %>%
  calculate(stat = "diff in means", order = c("renovated", "non-renovated"))
observed_stat_4
## Response: price (numeric)
## Explanatory: status (factor)
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  11.2

3.4.3.5 Step 4. Calculate the p-value

**\(p_{value}\) is calculated by comparing the observed statistic (observed_stat_4) to the null distribution (null_distribution_4). The p-value is to assess the likelihood of obtaining a result as extreme as the observed statistic under the null hypothesis.

# Calculate the p-value
p_value_4 <- null_distribution_4 %>%
  get_p_value(obs_stat = observed_stat_4, direction = "both")

p_value_4
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1   0.506

We graphed the null distribution with the observed statistic below:

ggplotly(null_distribution_4 %>%
  visualize() +
  shade_p_value(obs_stat = observed_stat_4$stat, color = "darkred", direction = "both") + theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold", family = "Times", color = "darkred"),
        axis.title = element_text(size = 12),
        axis.text = element_text(size = 10)) +
  scale_fill_manual(values = c("blue")) +
  labs(title = "Null Distribution of Differences in Price Means with Observed Statistic",
       x = "Differences in Price Means (KUSD)",
       y = "Frequency"))

Figure 3.4.3.4. Null distribution of Difference in means (Renovated house price - Non-renovated house price in KUSD) with Observed Statistic

3.4.3.6 Step 5. Interpretation - Stastical conclusion

  • \(p_{value} > \alpha = 0.05\)

  • The p-value obtained is greater than the significance level of 0.05. Therefore, we fail to reject the null hypothesis. This indicates that there is insufficient evidence to suggest a significant difference in the average prices of renovated and non-renovated homes in the Seattle Metropolitan Area. Consequently, the data does not support the claim that renovations lead to a statistically significant increase in home prices.

# Set the significance level
alpha <- 0.05

# Conclude the hypothesis testing result
if (p_value_4 < alpha) {
  conclusion <- "Reject the null hypothesis: There is a significant difference in mean prices between renovated and non-renovated houses."
} else {
  conclusion <- "Fail to reject the null hypothesis: There is no significant difference in mean prices between renovated and non-renovated houses."
}

conclusion
## [1] "Fail to reject the null hypothesis: There is no significant difference in mean prices between renovated and non-renovated houses."

3.4.3.7 Step 6. Interpretation - Stastical conclusion

This step generates a bootstrap distribution for the test statistic, which is helpful to estimate the variability of the difference in means between renovated and non-renovated homes based on resampling from the observed data.

If the interval includes zero, it means there might be no real difference between the two groups. If zero is not in the range, it suggests there is a significant difference in prices.

# Generate bootstrap distribution for the test statistic
bootstrap_distribution_4 <- null_hypothesis_4 %>%
  specify(price ~ status) %>%
  generate(reps = 10000, type = "bootstrap") %>%
  calculate(stat = "diff in means", order = c("renovated", "non-renovated"))

# Calculate the confidence interval and store it in ci_4
ci_4 <- bootstrap_distribution_4 %>%
  get_ci()

ci_4
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    -21.1     45.0
ggplot(bootstrap_distribution_4, aes(x = stat))  +
  geom_histogram(binwidth = 10, fill = "#81bfda", color = "black", alpha = 0.7) +
  geom_vline(aes(xintercept = ci_4$lower_ci), color = "#E38E49", linetype = "solid", size = 2) +
  geom_vline(aes(xintercept = ci_4$upper_ci), color = "#E38E49", linetype = "solid", size = 2) +
  geom_text(aes(x = ci_4$lower_ci, y = Inf, label = "Lower CI"), color = "#E38E49", vjust = -0.5, hjust = 1.1) +
  geom_text(aes(x = ci_4$upper_ci, y = Inf, label = "Upper CI"), color = "#E38E49", vjust = -0.5, hjust = -0.1) +
  scale_fill_brewer(palette = "Set3") +
  labs(title = "Simulation-based Bootstrap Distribution",
       x = "Difference in Means",
       y = "count", size = 14) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold", family = "Times", color = "darkblue"),
        axis.title = element_text(size = 14),
        axis.text = element_text(size = 12),
        panel.grid.major = element_line(color = "gray80"),
        panel.grid.minor = element_line(color = "gray90")) +
  annotate("rect", xmin = ci_4$lower_ci, xmax = ci_4$upper_ci, ymin = -Inf, ymax = Inf, alpha = 0.3, fill = "#E38E49")
**Figure 3.4.3.5.**: Simulation-based Bootstrap Distribution (Hypothesis 4).

Figure 3.4.3.5.: Simulation-based Bootstrap Distribution (Hypothesis 4).

The confidence interval ranges from -20.6 to 45.0, indicating that the true difference in price between renovated and non-renovated homes could be as low as -20.6 or as high as 45.0.

Since this interval includes zero, it suggests there is no clear significant difference in price, supporting the conclusion that renovations may not have a substantial impact on home prices in the Seattle Metropolitan Area.

3.4.4 Results and Discussion

3.4.4.1 Summarize the result

The hypothesis test compared the average prices of renovated and non-renovated homes in the Seattle Metropolitan Area. With a p-value of 0.4782, greater than the significance level of 0.05, we failed to reject the null hypothesis. The bootstrap confidence interval also included zero, indicating no significant difference in prices between the two groups. These results suggest that renovations do not significantly impact home prices in the area.

3.4.4.2 Explanation and Potential reasons

There are some factors that may explain this test result

  • Home prices are influenced by many variables such as location, neighborhood desirability, and economic conditions, which might overshadow the impact of renovations.
  • The dataset does not specify the types of renovations. Different levels of scale and quality of the renovation will cost differently. For example, the cost of a garage remodel ranges from \(\$3,000\) to \(\$15,000\), while remodeling a kitchen can cost between \(\$10,000\) and \(\$50,000\) (Waldek, 2023).
  • Buyer Preferences: Many buyers prefer non-renovated homes because they offer the opportunity to customize and renovate according to personal preferences and budget, leading to higher demand for these properties. Or some of them may just prefer the charm and uniqueness of older homes.

3.4.4.3 Limitations

The dataset covers the period from 1900 to 2014, so the results may not reflect current market trends. Additionally, construction and renovation costs have changed significantly over time. Major economic events, such as the 2008 financial crisis, may have also influenced and skewed the results.

3.4.4.4 Suggestions

3.4.4.4.1 For buyers

Since there is no significant difference between renovated and non-renovated houses, buyers can consider both options and focus on other important features such as square footage, waterfront location, number of floors, and overall potential. It’s advisable to prioritize location and long-term value rather than solely focusing on the renovation status.

3.4.4.4.2 For salesmen

Salespeople should emphasize the benefits of both renovated and non-renovated homes, helping buyers evaluate each option based on their needs. It’s important to educate buyers about the costs of renovations and the potential for customization in non-renovated homes, while also highlighting the convenience and modern features of renovated homes. Additionally, sales strategies should focus on other key property attributes such as location, square footage, and long-term potential. By doing this, salespeople can build trust in customers.

4 CONCLUSION

4.1 Summarize four Hypotheses

  • Hypothesis 1: Our analysis provides strong statistical evidence that the median house size in cities within the Seattle Metropolitan Area is smaller than 2,185 square feet. This is supported by a p-value of 0, leading to rejection of the null hypothesis, and a 95% confidence interval of 1,950 to 2,010 square feet, both below 2,185 square feet.

  • Hypothesis 2: The p-value of 0.122 (higher than alpha = 0.05) and the bootstrap confidence interval (\(p = 0.06\)) lying within the 95% range \([0.0051, 0.0101]\) suggest that there is no significant difference between the observed proportion of waterfront homes and the hypothesized proportion of 0.6%. We fail to reject the null hypothesis.

  • Hypothesis 3: The Chi-square test of independence yielded a statistic of 1006.8248 and a very small p-value (essentially 0), which is much lower than the significance level of 0.05. As a result, we reject the null hypothesis and conclude that the house condition (ratings from 1 to 5) is dependent on the house’s age group.

  • Hypothesis 4: The p-value greater than the significance level of 0.05, and the bootstrap confidence interval including 0, suggest that there is no significant difference in the average prices between renovated and non-renovated homes. We fail to reject the null hypothesis, indicating that renovations do not significantly affect home prices in the Seattle Metropolitan Area.

4.2 Suggestions

4.2.1 For Buyers

  • Consider both the chronological and effective age of a property, as effective age reflects the true remaining life and condition of the home.

  • Be realistic about the budget, especially when considering waterfront properties, due to their high costs and limited availability.

  • Consider both renovated and renovated houses because they are not significantly different in price. Furthermore, buys should focus on key features like square footage, location, number of floors, and long-term potential, rather than just the renovation status.

4.2.2 For Sellers

  • Highlight the premium and scarcity of waterfront properties to attract buyers.

  • Adjust pricing strategies according to market trends to maximize profit.

  • Ensure compliance with legal and environmental regulations to avoid future issues.

4.2.3 For Real Estate Brokers and Salesmen

  • Educate clients about the limited availability and higher prices of waterfront properties, as well as the legal and environmental risks but also leverage the scarcity of waterfront homes as a premium feature to attract investment buyers.

  • Diversify the portfolio, focusing more on homes without waterfront, as the majority of transactions involve such properties.

  • Emphasize the benefits of both renovated and non-renovated homes based on buyer preferences to build trust and meet customer needs.

5 REFERENCES

CoreLogic. (n.d.). Effective age versus actual age. Retrieved December 8, 2024, from https://www.corelogic.com/intelligence/effective-age-versus-actual-age/

FasterCapital. (n.d.). Factors affecting property value in real estate. Retrieved December 8, 2024, from https://fastercapital.com/topics/factors-affecting-property-value-in-real-estate.html

Forbes Home. (n.d.). House styles through the decades. Forbes. Retrieved December 8, 2024, from https://www.forbes.com/home-improvement/design/house-styles-through-decades/

Fox, J., Author, A. the, Jason Fox Facebook Twitter”The best way to find yourself is to lose yourself in the service of others.” ~ Gandhi  [ Recognized as a top 3.5% agent in the United States. ]    [ Jason Fox was born in Everett, Fox, J., Facebook, Twitter, & “The best way to find yourself is to lose yourself in the service of others.” ~ Gandhi  [ Recognized as a top 3.5% agent in the United States. ]    [ Jason Fox was born in Everett. (2024, July 25). Why real estate prices are so high in Seattle: A 2024 insight. The Madrona Group | 5 Puget Sound John L. Scott Locations. https://www.themadronagroup.com/why-real-estate-prices-are-so-high-seattle/

Fuerst, Franz, Pat McAllister, Anupam Nanda, and Peter Wyatt. “An investigation of the effect of EPC ratings on house prices.” Department of Energy and Climate Change, June 17, 2013.

Gatea, M. (2021, June 23). Seattle metro lot size decreasing by over 30% over the past two decades. https://www.storagecafe.com/blog/seattle-lot-sizes-drop-to-20-year-lows/

NeoMam Studios. (2022, December 2). The median home size in every U.S. state in 2022. Visual Capitalist. https://www.visualcapitalist.com/cp/median-home-size-every-american-state-2022/

Orton, K. (2015, February 20). The d.c.-area housing market, decoded: A 2014 statistical breakdown by ZIP code - The Washington Post. The Washington Post. https://www.washingtonpost.com/realestate/the-dc-area-housing-market-decoded-a-2014-statistical-breakdown-by-zip/2015/02/20/dbf57e76-b7af-11e4-a200-c008a01a6692_story.html

Rolling Out. (2024, August 24). The impact of gentrification on homeownership. Retrieved December 10, 2024, from https://rollingout.com/2024/08/24/impact-of-gentrification-homeownership/

Shalom, S. (2018, August 20). What size home should I buy? - coldwell banker blue matter blog. Coldwell Banker Blue Matter. https://blog.coldwellbanker.com/size-matters-finding-perfect-size-home/

Soden, |By. (2024, March 6). The hidden risks and liabilities of owning waterfront property. OLIVER L.E. SODEN AGENCY. https://sodeninsurance.com/the-hidden-risks-and-liabilities-of-owning-waterfront-property/

Stefanie Waldek, “How Much Does It Cost to Renovate a House?,” Architectural Digest, January 5, 2023, https://www.architecturaldigest.com/story/cost-to-renovate-a-house.

SuperAdmin. (2022, May 5). Is bigger better? pros & cons of larger homes by Normandy Homes. Normandy Homes. https://normandyhomes.com/lifestyle/is-bigger-better-pros-and-cons-of-buying-a-larger-home/

Today’s Homeowner. (n.d.). What is the median home age in the U.S.? Retrieved December 8, 2024, from https://todayshomeowner.com/home-finances/guides/median-home-age-us/

Xu, L. (2022). The impact of structural and energy-efficient renovations on property values (Doctoral dissertation). University of Padova. https://www.universityofpadova.com/thesis/impact_of_renovations